feat(gpu-crud): record gpu_mode and MIG strategy on every node-pool op by xuexu6666 · Pull Request #1233 · Azure/telescope

xuexu6666 · 2026-06-24T22:57:49Z

Problem

GPU CRUD benchmark records could not reliably distinguish managed vs fully-managed GPU pools. The stable azure-mgmt-containerservice SDK does not model gpuProfile.nvidia.managementMode, so a fully-managed pool's mode is silently dropped from the nodepool_info read-back — leaving only the pool name (a brittle string) to tell them apart. On top of that:

MIG single vs mixed was dropped on scale operations.
scale_node_pool never recorded the managed flag at all.

What this does

Derive a normalized set of GPU metadata from the operation input flags (not the lossy AKS read-back) and attach it to create, scale, and progressive-scale operations:

field	values
`gpu_mode`	`none` \| `managed` \| `fully_managed`
`enable_managed_gpu`	normalized fully-managed flag
`mig_enabled`	bool
`gpu_instance_profile`	e.g. `MIG1g` \| `null`
`gpu_mig_strategy`	`single` \| `mixed` \| `null`

Thread gpu_mig_strategy through the scale path (main.py → NodePoolCRUD → AKSClient → _progressive_scale) and through execute.yml so scale-up/scale-down records are accurate.
Echo the metadata to the console on each op (GPU pool metadata: gpu_mode=… mig_enabled=… …) so it is visible in pipeline logs, not only in the uploaded results.json.

Robustness / review hardening

gpu_instance_profile / gpu_mig_strategy are Azure-only MIG inputs; the AWS CRUD does not accept them (no **kwargs). They are now only forwarded when --cloud azure, avoiding a TypeError on AWS runs.
_gpu_mode_metadata normalizes flag combinations so records are always internally consistent: managed/MIG only apply to a GPU pool; MIG only applies to fully-managed pools (dropped otherwise); invalid gpu_mig_strategy is rejected.
execute.yml gates --gpu-instance-profile / --gpu-mig-strategy on ENABLE_MANAGED_GPU==true (create, scale-up, scale-down) so MIG flags are only emitted for fully-managed pools.

Incidental fixes (folded in)

Scale poller timeout 20m → 30m (_begin_update_with_retry): A100 MIG node provisioning could exceed the old 1200s timeout, causing spurious scale failures.
verify_nvidia_smi_on_node MIG-single race: the readiness wait broke as soon as the nvidia.com/gpu key appeared, even with value 0. A freshly-scaled MIG-single node registers nvidia.com/gpu=0 before publishing its MIG instances, so the check raced in, read 0, and skipped the node with a misleading "has no GPUs" warning. Now waits for a positive GPU/MIG count.

Verification

Validated end-to-end on the GPU Cluster CRUD pipeline — latest run build 71637 (commit e07404fa, succeeded), A100 chain managed → fully_managed → MIG-mixed → MIG-single.

Recorded metadata in the published results.json is correct on every create/scale_up/scale_down op (confirmed across builds 71550/71623/71637):

stage	gpu_mode	enable_managed_gpu	mig_enabled	gpu_instance_profile	gpu_mig_strategy
managed	`managed`	`False`	`False`	`None`	`None`
fully_managed	`fully_managed`	`True`	`False`	`None`	`None`
MIG mixed	`fully_managed`	`True`	`True`	`MIG1g`	`mixed`
MIG single	`fully_managed`	`True`	`True`	`MIG1g`	`single`

Also confirmed in build 71637 logs:

Console echo prints per op, e.g. GPU pool metadata: gpu_mode=managed enable_managed_gpu=False mig_enabled=False gpu_instance_profile=None gpu_mig_strategy=None.
The MIG-single nvidia.com/gpu=0 race is fixed — a100migsin is no longer skipped.
No scale-poller timeout (30-min bump applied).

managed vs fully_managed are now distinguishable, and MIG mixed vs single is recorded on scale ops as well as create.

Known / out-of-scope (not introduced by this PR)

Observed in build 71637 but unrelated to these changes:

A100 vCPU quota exhaustion in southcentralus (ErrCode_InsufficientVCPUQuota, requested 192 / remaining 0) during an a100migmixed scale-up — environmental capacity, not a code issue.
Managed-pool GPU advertisement: a100managed (driver-install) nodes did not advertise nvidia.com/gpu within the wait window, so nvidia-smi verification skipped them. This is pre-existing behavior (the prior code also waited and skipped, since the key never appears) — a device-plugin scheduling/timing issue on the managed A100 path, worth a separate follow-up.

Testing

Unit tests for the metadata helper variants, normalization (MIG dropped for non-fully-managed, managed flag normalized without a GPU pool, invalid strategy rejected), scale-op metadata persistence, console echo, an AWS regression test asserting MIG kwargs are omitted for --cloud aws, and a verify_nvidia_smi_on_node regression test (node reports 0 then a positive count → verified, not skipped).
End-to-end validated via build 71637 (tables above).

GPU CRUD benchmark records could not reliably distinguish managed vs fully-managed GPU pools: the stable azure-mgmt-containerservice SDK does not model gpuProfile.nvidia.managementMode, so a fully-managed pool's mode is silently dropped from the nodepool_info read-back, leaving only the pool name (a brittle string) to tell them apart. MIG single vs mixed was also dropped on scale ops, and scale_node_pool never even recorded the managed flag. Derive a normalized set of GPU metadata from the operation INPUT flags and attach it to create, scale, and progressive-scale operations: - gpu_mode: "none" | "managed" | "fully_managed" - enable_managed_gpu, mig_enabled - gpu_instance_profile, gpu_mig_strategy ("single" | "mixed") Thread gpu_mig_strategy through the scale path (main.py -> NodePoolCRUD -> AKSClient -> _progressive_scale), and pass the scale-path GPU flags through execute.yml so scale-up/scale-down records are accurate in the pipeline. Add unit tests for the metadata helper variants and for scale ops persisting gpu_mode + MIG fields.

Copilot

Pull request overview

This PR improves the fidelity of GPU CRUD benchmark records by deriving and persisting normalized GPU metadata (managed vs fully-managed mode and MIG configuration) from operation input flags, rather than relying on AKS SDK read-back fields that may be missing.

Changes:

Add a normalized GPU metadata helper (gpu_mode, MIG fields) and attach it to create/scale/progressive-scale operation metadata.
Thread gpu_mig_strategy through the Azure scale call chain (main.py → NodePoolCRUD → AKSClient → _progressive_scale).
Update the pipeline execution step to pass MIG strategy on scale operations, and add unit tests covering metadata normalization and scale metadata persistence.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
steps/engine/crud/k8s/execute.yml	Passes additional GPU/MIG CLI flags into create/scale steps so pipeline records can include them.
modules/python/tests/clients/test_aks_client.py	Adds unit tests for GPU metadata normalization and scale operation metadata persistence.
modules/python/crud/main.py	Threads `gpu_mig_strategy` into scale/all kwargs so the CLI can pass it into CRUD implementations.
modules/python/crud/azure/node_pool_crud.py	Extends Azure NodePoolCRUD signatures/forwarding to include `gpu_mig_strategy`.
modules/python/clients/aks_client.py	Introduces `_gpu_mode_metadata` and attaches it to operation metadata across create/scale/progressive-scale.

Existing exact-call assertions in test_main.py and test_azure_node_pool_crud.py broke because scale_node_pool / all now receive gpu_mig_strategy. Add the new kwarg to the expected calls.

The helper is intentionally tested directly; disable W0212 for that test method.

…-access The def-line block disable did not suppress W0212 in CI; bind the protected helper to a single inline-disabled local and call that instead.

…t GPU metadata - main.py: gpu_instance_profile/gpu_mig_strategy are Azure-only MIG inputs; the AWS CRUD does not accept them (no **kwargs) and would raise TypeError. Only forward them when --cloud azure, via a conditional azure_gpu_kwargs dict. - aks_client._gpu_mode_metadata: normalize flag combinations so records are always internally consistent — enable_managed_gpu/MIG only apply to a GPU pool; MIG only applies to fully-managed pools (dropped otherwise); reject invalid gpu_mig_strategy values. - execute.yml: gate --gpu-instance-profile/--gpu-mig-strategy on ENABLE_MANAGED_GPU==true (create, scale-up, scale-down) so MIG flags are only emitted for fully-managed pools. - tests: set cloud="azure" on the GPU-kwarg cases, add an AWS regression test asserting MIG kwargs are omitted, and cover the new metadata normalization.

Log gpu_mode/enable_managed_gpu/mig_enabled/gpu_instance_profile/gpu_mig_strategy on each create/scale/progressive-scale op so the GPU mode is visible directly in pipeline logs (not only in the uploaded results.json). Skipped for non-GPU pools.

…s (C0302) The console-echo helper pushed the module to 1014 lines (limit 1000). Condense the _gpu_mode_metadata docstring and the gpu_mig_strategy ValueError message.

A100 MIG-single node provisioning can exceed the previous 1200s (20 min) scale poller timeout, causing spurious scale failures. Bump _begin_update_with_retry default timeout to 1800s (30 min).

…n nvidia-smi verify verify_nvidia_smi_on_node broke its readiness wait as soon as the nvidia.com/gpu key appeared, even with value 0. On a freshly-scaled MIG-single node the device plugin registers nvidia.com/gpu=0 before publishing the MIG instances, so the check raced in, read 0, and skipped the node with a misleading 'has no GPUs' warning. Wait for a positive GPU or MIG slice count instead of mere key presence. Tests: update no-GPU test for the new wait semantics; add a regression test where a node reports 0 then a positive count and must be verified (not skipped).

Copilot AI review requested due to automatic review settings June 24, 2026 22:57

xuexu6666 requested review from LeonardCareer, alyssa1303, anson627, liyu-ma, sumanthreddy29, vittoriasalim, wonderyl and xinWeiWei24 as code owners June 24, 2026 22:57

Copilot started reviewing on behalf of xuexu6666 June 24, 2026 22:58 View session

Copilot AI reviewed Jun 24, 2026

View reviewed changes

Comment thread modules/python/crud/main.py

Comment thread modules/python/crud/main.py Outdated

Comment thread steps/engine/crud/k8s/execute.yml Outdated

Comment thread steps/engine/crud/k8s/execute.yml Outdated

Comment thread modules/python/clients/aks_client.py Outdated

xuexu6666 added 8 commits June 24, 2026 18:24

test: update scale/all call assertions for new gpu_mig_strategy kwarg

2adf27a

Existing exact-call assertions in test_main.py and test_azure_node_pool_crud.py broke because scale_node_pool / all now receive gpu_mig_strategy. Add the new kwarg to the expected calls.

test: silence pylint protected-access for _gpu_mode_metadata unit test

b9e29c9

The helper is intentionally tested directly; disable W0212 for that test method.

test: alias _gpu_mode_metadata to a local to satisfy pylint protected…

3e07a2d

…-access The def-line block disable did not suppress W0212 in CI; bind the protected helper to a single inline-disabled local and call that instead.

style: trim aks_client docstring to stay under pylint max-module-line…

f51d30a

…s (C0302) The console-echo helper pushed the module to 1014 lines (limit 1000). Condense the _gpu_mode_metadata docstring and the gpu_mig_strategy ValueError message.

fix(gpu-crud): raise node-pool scale poller timeout 20m -> 30m

2b93c03

A100 MIG-single node provisioning can exceed the previous 1200s (20 min) scale poller timeout, causing spurious scale failures. Bump _begin_update_with_retry default timeout to 1800s (30 min).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(gpu-crud): record gpu_mode and MIG strategy on every node-pool op#1233

feat(gpu-crud): record gpu_mode and MIG strategy on every node-pool op#1233
xuexu6666 wants to merge 9 commits into
mainfrom
xuxue/gpu-mode-mig-metadata

xuexu6666 commented Jun 24, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

xuexu6666 commented Jun 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

What this does

Robustness / review hardening

Incidental fixes (folded in)

Verification

Known / out-of-scope (not introduced by this PR)

Testing

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

xuexu6666 commented Jun 24, 2026 •

edited

Loading